library(tidyverse)
library(plotly)
library(scales)
library(foreign)
library(psych)
X2016 <- as_tibble(read.spss("C:/Users/User/OneDrive/Desktop/rmijankyal/Midtermproj/2016.sav", to.data.frame = TRUE))
histogram <- ggplot(X2016) + labs(y = "Frequency", x = "Expenditure")+ geom_histogram(aes(x = expend), fill = 'blue',bins = 100)
ggplotly(histogram)
density <- ggplot(data = X2016) +
labs(title = "Density of Expenditure & Totincome", y = "Density", x = "Expenditure, Totincome") +
geom_density(aes(x = expend, y = ..density..)) + geom_density(aes(x=totincome, y=..density..), color="red")
ggplotly(density)
boxplot <- ggplot(data = X2016) + labs(title = "Boxplot of expenditures", y = "Expenditure") +
geom_boxplot(aes(y = expend))
ggplotly(boxplot)
The graphs show that the distribution of expenditures is not normal, as we have got a positive skewness and a very high value of kurtosis. So most of the outliers are fallen on the right side of the graph.
library(pastecs)
options(scipen = 100)
options(digits = 2)
stat.desc(X2016$expend)
## nbr.val nbr.null nbr.na min max
## 5184.00 0.00 0.00 8186.49 2923503.83
## range sum median mean SE.mean
## 2915317.34 742300247.45 116002.20 143190.63 1906.53
## CI.mean.0.95 var std.dev coef.var
## 3737.60 18843019184.87 137269.88 0.96
library(psych)
describe(X2016$expend)
## vars n mean sd median trimmed mad min max range skew
## X1 1 5184 143191 137270 116002 123986 72171 8186 2923504 2915317 8.5
## kurtosis se
## X1 131 1907
library(pastecs)
options(scipen = 100)
options(digits = 2)
stat.desc(X2016$totincome)
## nbr.val nbr.null nbr.na min max
## 5184.00 3.00 0.00 0.00 2370500.00
## range sum median mean SE.mean
## 2370500.00 1013060788.91 150267.07 195420.68 2476.32
## CI.mean.0.95 var std.dev coef.var
## 4854.63 31789126976.58 178295.06 0.91
library(psych)
describe(X2016$totincome)
## vars n mean sd median trimmed mad min max range skew
## X1 1 5184 195421 178295 150267 164570 106473 0 2370500 2370500 3.4
## kurtosis se
## X1 20 2476
_ As we can see in the results mean of total income is higher than
mean of expenditures, which means in 2016 there actually have been
savings in Armenia too. In my further research we will see that in many
cases households’ expenditures are higher than total income though.
_ Although mean of total income is higher than that of expenditures, the
range of expenditures are higher than the one of total income. This
means that the dispersion of expenditures is higher than the one of
totincome.
_ As we can also see in the density graphs, the skewness
of expenditure exceeds the one of totincome. Both of them have positive
skewness that is heavy tail on the right side.
_ It is clear that
from the graph that the kurtosis of totincome is much smaller than that
of expenditures, which means the variance of totincome is higher than
that of expenditures. In fact our descriptive-statistics data confirms
that and we get almost 3 times higher variance for totincome.
\(H_0\): The average
household expenditures in Armenia are 195421 AMD.
\(H_1\): The average household
expenditures in Armenia are not equal to 195421 AMD (that is the mean of
totincome).
t.test(x = X2016$expend, mu = mean(X2016$totincome), alternative = "two.sided")
##
## One Sample t-test
##
## data: X2016$expend
## t = -27, df = 5183, p-value <0.0000000000000002
## alternative hypothesis: true mean is not equal to 195421
## 95 percent confidence interval:
## 139453 146928
## sample estimates:
## mean of x
## 143191
As we have got a very low p-value very close to 0, we reject our null hypothesis, which means the average household expenditures are not equal to mean of totincome.
\(H_0\): The average
household expenditures in different types of settlements in Armenia are
equal.
\(H_1\):
The average household expenditures in different types of settlements in
Armenia are not equal.
X2016<-X2016%>%mutate(sett_new=case_when((settlement=="Yerevan"|settlement=="other urban")~"urban", settlement=="rural"~"rural"))
table(X2016$settlement, X2016$sett_new)
##
## rural urban
## Yerevan 0 1404
## other urban 0 1836
## rural 1944 0
t.test(X2016$expend~X2016$sett_new)
##
## Welch Two Sample t-test
##
## data: X2016$expend by X2016$sett_new
## t = -4, df = 5058, p-value = 0.0002
## alternative hypothesis: true difference in means between group rural and group urban is not equal to 0
## 95 percent confidence interval:
## -20353 -6189
## sample estimates:
## mean in group rural mean in group urban
## 134896 148167
As we get a p-value lower than 0.01 for our test, we reject the null hypothesis, which means the average household expenditures vary from rural to urban settlements.
\(H_0\): The average
household expenditures and total income in Armenia are
equal.
\(H_1\):
The average household expenditures and total income in Armenia are not
equal.
t.test(x = X2016$monincome, y =X2016$expend)
##
## Welch Two Sample t-test
##
## data: X2016$monincome and X2016$expend
## t = 13, df = 9766, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 33081 45270
## sample estimates:
## mean of x mean of y
## 182366 143191
As long as we got a very low p-value close to 0, we reject the null hypothesis, which means the average household expenditures and total income in Armenia are NOT equal. Economically it means that in Armenia people either would have had savings, or they have just had extra-costs associated with the interest rates of the loans taken previously, which is more realistic than the first assumption.
X2016 <- X2016%>%mutate(expendLevels = case_when(expend <= 57000 ~ "Very low", expend <= 89000 ~ "Low", expend <= 300000 ~ "Medium", expend <= 500000 ~ "High", expend > 500000 ~ "Very high") )
X2016$expendLevels<- factor(X2016$expendLevels, ordered = TRUE, levels = c("Very low", "Low", "Medium", "High", "Very high"))
\(H_0\): Total expend levels
and martial status are independent
\(H_1\): Total expend levels and martial
status are not independent
chisq.test(X2016$headmerstatus, X2016$expendLevels)
##
## Pearson's Chi-squared test
##
## data: X2016$headmerstatus and X2016$expendLevels
## X-squared = 632, df = 16, p-value <0.0000000000000002
As long as we have got a very low p-value almost equal to 0 and smaller than 0.05, it means we can surely reject the null hypothesis. In other words total expend levels and martial status are NOT independent. Economically this means that martial status affects the expenditures either positively or negatively. We may assume that for example those who are married and have children should have much more expenditures related to family expenses.
\(H_0\): The means
in the groups of the “Martial Status” are equal
\(H_1\): At least one mean is
different.
summary(aov(data = X2016,expend ~ headmerstatus))
## Df Sum Sq Mean Sq F value Pr(>F)
## headmerstatus 4 2398958537366 599739634342 32.6 <0.0000000000000002 ***
## Residuals 5179 95264409897838 18394363757
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the F value of the test is very low almost equal to 0, we reject our null hypothesis, which means at least one mean in the groups of the “Martial Status” differs from others. Economically the meanings of this and the previous tests are the same.
marzbox <- ggplot() + labs(title = "Expends by Marital status", y = "Expenditures", x = "Marz") +
geom_boxplot(data = X2016, aes(y = expend, x = marz, color = marz))
ggplotly(marzbox)